getwd()
## [1] "/home/user3/workshop"

1. Load input data

Raw data from Pezzini et al 2017 was subjected to differential gene expression analysis with Degust and the results file saved to Pezzini_DE.txt.

The input data file is within the current working directory so we do not need to specify its directory path.

data <- read_tsv("Pezzini_DE.txt", col_names = TRUE, show_col_types = FALSE)
head(data)

The dataframe shows genes with fold change and FDR values, along with some normalised counts values for the 6 samples (2 groups with 3 replicates each).

Look on the environment pane of RStudio, and you can see a description ‘14420 obs. of 10 variables’ - this shows your dataframe consists of 10 columns and 14,420 genes.

2. Get ORA gene list

Now we need to filter for differentially expressed genes (DEGs), and we will apply the thresholds adjusted P values/FDR < 0.01, and log2fold change of 2.

degs <- data %>%
  filter(FDR < 0.01 & abs(Log2FC) > 2) %>%
  pull(Gene.ID)
cat("Number of genes passing FDR and fold change filter:", length(degs), "\n")
## Number of genes passing FDR and fold change filter: 792
# Save the DEG gene list to disk: 
write.table(degs, "Pezzini_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

We have 792 genes passing our filters.

3. Get background gene list

Recall from the webinar and day 1 of the workshop that an experimental background gene list is crucial to avoiding false positives and minimising tissue bias with ORA.

The analysis in Degust has already removed lowly expressed genes, so we can simply extract all genes from this data matrix as our background gene list and save it as our ‘background’ object, as well as save to disk so that we can include it within the supplementary materials of any resultant publications for reproducibility.

background <- data$Gene.ID
cat("Number of background genes:", length(background), "\n")
## Number of background genes: 14420
# Save the background gene list to disk:
write.table(background, "Pezzini_background_genes.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

4. Run ORA with gost function

Before running the below code chunk, review the parameters. Observe the similarities to what was available on the g:Profiler web interface, for example organism, the correction method (g:Profiler’s custom g_scs method), and domain scope (background genes).

Every R package has a user guide that outlines available parameters. Open the gprofiler2 user guide and navigate to page detailing the gost function (currently page 6).

Note: to open this link for the RStudio training VM, highlightthe link, right click, then select “Go to https://cran…”

Under the Usage subheading, all available parameters are listed, along with their default settings. You can over-ride these parameters by specifying your custom values in your R command.

Run the below code which explicitly includes all available gost parameters:

ora <- gost( 
    degs, 
    organism = "hsapiens", 
    ordered_query = FALSE, 
    multi_query = FALSE, 
    significant = TRUE, 
    exclude_iea = FALSE, 
    measure_underrepresentation = FALSE, 
    evcodes = FALSE, 
    user_threshold = 0.05, 
    correction_method = "g_SCS", 
    domain_scope = "custom_annotated", 
    custom_bg = background, 
    numeric_ns = "", 
    sources = NULL, 
    as_short_link = FALSE, 
    highlight = FALSE
)

An error-free gost run should produce no console output. Our results are saved in the R object ora.

Since we are using many of the default parameters, we could shorten this code to what is shown below. The results would be identical, however not as transparent as far as easily identifying what parameters were applied to a run.

Tip: you can run the command formals(gost) to print out all parameters and their default values. This is a generic R function so can be used on other tools outside of gprofiler2.

#ora <- gost( 
#    degs, 
#    correction_method = "g_SCS", 
#    domain_scope = "custom_annotated",
#    custom_bg = background,
#)
formals(gost)
## $query
## 
## 
## $organism
## [1] "hsapiens"
## 
## $ordered_query
## [1] FALSE
## 
## $multi_query
## [1] FALSE
## 
## $significant
## [1] TRUE
## 
## $exclude_iea
## [1] FALSE
## 
## $measure_underrepresentation
## [1] FALSE
## 
## $evcodes
## [1] FALSE
## 
## $user_threshold
## [1] 0.05
## 
## $correction_method
## c("g_SCS", "bonferroni", "fdr", "false_discovery_rate", "gSCS", 
##     "analytical")
## 
## $domain_scope
## c("annotated", "known", "custom", "custom_annotated")
## 
## $custom_bg
## NULL
## 
## $numeric_ns
## [1] ""
## 
## $sources
## NULL
## 
## $as_short_link
## [1] FALSE
## 
## $highlight
## [1] FALSE

View the top-most significant enrichments with the R head command. Only significant enrichments passing your specified threshold (adjusted P value < 0.05) are included in the results object.

head(ora$result)

Let’s give our query a name:

ora$result$query <- 'DEGs_Padj0.05_FC2'
head(ora$result)

Use the small black arrow near the column names to view columns wider than the page width. The head view only shows the top 6 enrichments, which are all GO Biological Process.

We can obtain a list of all databases with significant enrichments from our query list:

unique(ora$result$source)
##  [1] "GO:BP" "GO:CC" "GO:MF" "HP"    "HPA"   "KEGG"  "MIRNA" "REAC"  "TF"   
## [10] "WP"

Same as the web tool, we have enrichment results for GP (BP, CC, MF), HP (human phenotype), HPA (human protein atlas), KEGG, MiRNA, Reactome, Transcription Factors, and WikiPathways.

5. Save the results to a file

Let’s save the results file to disk. First, we will re-order the columns so the output more closely matches the tables that are downloaded from the web version of the tool.

ora_reordered <- ora$result[, c("source", "term_name", "term_id", "p_value", "term_size", "query_size", "intersection_size", "effective_domain_size")]

head(ora_reordered)
write.csv(ora_reordered, "gprofiler_ORA_results.csv", row.names = FALSE)

6. Visualise the results

Manhattan plots

The gostplot function creates a Manhattan plot similar to the one shown on the web tool. By setting interactive=TRUE we can hover over the data points to see enriched term details.

The parameter capped = TRUE is an indicator whether the -log10(p-values) would be capped at 16 if bigger than 16. This fixes the scale of y-axis to keep Manhattan plots from different queries comparable and is also intuitive as, statistically, p-values smaller than that can all be summarised as highly significant.

gostplot(ora, 
  capped = TRUE, 
  interactive = TRUE, 
  pal = c(`GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099") 
)

There are a lot of significant enrichments for GO biological processes. Many of these are probably terms containing a large number of genes, so not particularly informative. Other R tools have default settings limiting the minimum and maximum number of genes in a geneset to be included in the analysis. Since there is no direct parameter to restrict term size to gostplot, we can filter the ORA results before plotting. Let’s apply a maximum gene set size of 500, and a minimum gene set size of 10, which are the default setting used by clusterProfiler.

# Filter the results for GO:BP terms with term_size <= 500
ora_filter_termsize <- ora
ora_filter_termsize$result <- ora$result %>% filter(term_size <= 500) %>% filter(term_size >= 10)

# Plot with gostplot using the filtered results
gostplot(ora_filter_termsize, 
  capped = TRUE, 
  interactive = TRUE, 
  pal = c(
    `GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099"
  )
)

This has cleaned up ‘Biological Process’ a little bit, enabling signals of more specific terms to be highlighted.

gprofiler2 includes a function for creating a publication-ready image that can optionally highlight specific terms. We need to first produce a plot with interactice = FALSE, save it to an object, and then provide that plot object to the publish_gostplot function.

# Plot with gostplot using the filtered results
plot <- gostplot(ora_filter_termsize, 
  capped = TRUE, 
  interactive = FALSE, 
  pal = c(
    `GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099"
  )
)

The publish_gostplot parameter highlight_terms enables you to highlight specific terms on the plot, with a table showing enrichment details below for those highlighted terms.

Let’s highlight some selected terms manually. You need to provide the term ID not term name.

#Term IDs for 'Collagen degradation' and 'Collagen formation'
highlight <- c("REAC:R-HSA-1442490", "REAC:R-HSA-1474290")

filename <- "gostplot_filter_termsize_collagen.png"

publish_gostplot(plot, 
  highlight_terms = highlight, 
  filename, 
  width = 10, 
  height = 10 )
## The image is saved to gostplot_filter_termsize_collagen.png

You can use R grepl function to search for terms with names matching some keyword. Let’s highlight all terms related to receptors. The code chunk applies an increased figure height, to ensure we can see the whole plot within the notebook.

# Filter terms containing "receptor" keyword and create a list of those term IDs
highlight <- ora$result %>% 
  filter(grepl("receptor", term_name, ignore.case = TRUE)) %>% 
  pull(term_id) 

filename <- "gostplot_filter_termsize_receptors.png"

publish_gostplot(plot, 
  highlight_terms = highlight, 
  filename, 
  width = 10, 
  height = 10 )
## The image is saved to gostplot_filter_termsize_receptors.png

Dotplots

One of the advantages of working in R is flexibility with visualisations. While the interactive Manhattan plots and publish_gostplot options are nice, it can also be useful to visualise P values aginst all term descriptions.

One way to do this is with a dotplot. We can loop through all databases that gost has enriched against, and use the R package ggplot2 to make a dotplot, first filtering the results by database and(if desired) term size:

# List of databases
dbs <- unique(ora$result$source)

# Loop through each database and create a separate plo for each
for (db in dbs) {
  #db_results <- ora$result 
  db_results <- ora$result %>% filter(source == db, term_size >= 10, term_size <= 500)
  
  # Create a dot plot for the current database
  p <- ggplot(db_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
    geom_point(aes(size = term_size, color = significant)) +
    labs(title = paste(db), 
         x = "Term", 
         y = "-log10(p-value)") +
    theme_minimal() +
    coord_flip()  # Flips the coordinates for better visibility
  
  # Print the plot
  print(p)
}

For plots with a lot of enriched terms, such as GO Biological Process, the display within the notebook is less than ideal. Saving the plot to an image file enables better resolution:

# Filter for the GO:BP database
go_bp_results <- ora$result %>% filter(source == "GO:BP",term_size >= 10, term_size <= 500)

# Create the plot for GO:BP
p_go_bp <- ggplot(go_bp_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
  geom_point(aes(size = term_size, color = significant)) +
  labs(title = "GO:BP", 
       x = "Term", 
       y = "-log10(p-value)") +
  theme_minimal() +
  coord_flip()  # Flips the coordinates for better visibility


# Open a PDF device to save the plot as a full size A4: 
pdf("gprofiler_GO_BP_dotplot.pdf", width = 8.27, height = 11.69)  # A4 portrait size in inches

# Print the plot to the device
print(p_go_bp)

# Close the device (this saves the plot)
dev.off()
## png 
##   2

Open this plot by clicking it from the ‘Files’ pane of RStudio. Notice how the term names are now readable :-)

7. Run a gost multi-query for up-regulated and down-regulated genes, and compare the results to those generated from the full DEG gene list

By providing more than one gene list and setting multi_query = TRUE, results from all of the gene lists are grouped by term IDs for easier comparison. This can be handy when you have multiple comparisons within an experiment, or when you want to investigate enrichments within the up and down regulated genes separately.

First, we need to re-extract our gene list so we have separate DEGs for up-regulated and down-regulated genes.

up_degs <- data %>%
  filter(FDR < 0.01 & Log2FC >= 2) %>%
  pull(Gene.ID)
cat("Number of upregulated DEGs:", length(up_degs), "\n")
## Number of upregulated DEGs: 577
# Save the DEG gene list to disk: 
write.table(up_degs, "up_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

down_degs <- data %>%
  filter(FDR < 0.01 & Log2FC <= -2) %>%
  pull(Gene.ID)
cat("Number of downregulated DEGs:", length(down_degs), "\n")
## Number of downregulated DEGs: 215
# Save the DEG gene list to disk: 
write.table(down_degs, "down_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

Now run gost as multi-query. The changes required for multi-query are providing a list of gene list objects to the query parameter instead of a single gene list object, and setting multi_query = TRUE, which is FALSE by default. By including all genes as well, we can efficintly compare up vs down vs no separation.

ora_multi <- gost( 
    query = list("upregulated" = up_degs, "downregulated" = down_degs, "all_DEGs" = degs), 
    organism = "hsapiens", 
    ordered_query = FALSE, 
    multi_query = TRUE, 
    significant = TRUE, 
    exclude_iea = FALSE, 
    measure_underrepresentation = FALSE, 
    evcodes = FALSE, 
    user_threshold = 0.05, 
    correction_method = "g_SCS", 
    domain_scope = "custom_annotated", 
    custom_bg = background, 
    numeric_ns = "", 
    sources = NULL, 
    as_short_link = FALSE, 
    highlight = FALSE 
)

Now create a multi-query interactive Manhattan plot with gostplot:

gostplot(ora_multi, capped = TRUE, interactive = TRUE)

Unfortunately, notebook view squashes the top plot over the bottom one, and adjusting figure height or plot layout options doesn’t seem to help. Plotting as non-interactive or plotting from the console to the plots pane both produce a correct looking plot.

p <- gostplot(ora_multi, capped = TRUE, interactive = FALSE)
filename <- "gprofiler_ORA_multiquery.png"

publish_gostplot(p, 
  highlight_terms = NULL, 
  filename, 
  width = 10, 
  height = 10 )
## The image is saved to gprofiler_ORA_multiquery.png

To access the tabular results separately, they need to be split, as the default results view

NOT WORKING

NR <- nrow(ora$result)
NS <- ifelse(NR < 10, NR, 10)
termSource <- sample(nrow(ora$result), NS)
VEC <- ora$result[termSource, "term_id"]

publish_gosttable(ora,
  highlight_terms = VEC,
  use_colors = TRUE,
  show_columns = c("source", "term_name", "term_size", "intersection_size"),
  filename = NULL)

Head ora multi in notebook:

head(ora_multi$result)
NR <- nrow(ora_multi$result)
NS <- ifelse(NR < 10, NR, 10)
termSource <- sample(nrow(ora_multi$result), NS)
VEC <- ora_multi$result[termSource, "term_id"]

filename <- "publish-gosttable-multi.pdf"

publish_gosttable(ora,
  highlight_terms = VEC,
  use_colors = TRUE,
  show_columns = c("source", "term_name", "term_size", "intersection_size", "p_values", "significant"),
  filename)
## The image is saved to publish-gosttable-multi.pdf

END NOT WORKING

8. Compare gprofiler2 R results to the g:Profiler web results

In day 1 of the workshop, you ran ORA with g:Profiler web tool and saved the results to a CSV. Let’s compare the results to those we have generated in R. Do we expect the results to be identical or differ slightly?

The input file here is one that we have created, but should match yours as long as you used the same P filters and gprofiler parameters.

web <- read.csv("gProfiler_hsapiens_07-11-2024_11-27-09__intersections.csv")

Check the numbers: are there any terms significant frm one tool but not the other?

# Extract significant term names
web_terms <- web$term_name
ora_terms <- ora$result$term_name

paste0("Number of significant terms from web: ", length(web_terms))
## [1] "Number of significant terms from web: 273"
paste0("Number of significant terms from R: ", length(ora_terms))
## [1] "Number of significant terms from R: 273"
# Find command and unique terms
common_terms <- intersect(web_terms, ora_terms)
if (length(common_terms) == length(web_terms) && length(common_terms) == length(ora_terms)) {
  # If the lengths match, all terms are shared
  print("All terms are shared")
} else {
  # If there are differences, report the number of terms
  unique_web <- setdiff(web_terms, ora_terms)
  unique_ora <- setdiff(ora_terms, web_terms)
  
  print(paste("Number of terms unique to web:", length(unique_web)))
  print(paste("Number of terms unique to gprofiler2 (R):", length(unique_ora)))
}
## [1] "All terms are shared"

That’s a good start! Do the P values differ? Let’s look closely at GO ‘Molecular Function’.

# Filter for GO:MF terms
go_mf_web <- web %>% filter(source == "GO:MF")
go_mf_r <- ora$result %>% filter(source == "GO:MF")

# Extract term names and p values
comparison_data_go_mf <- data.frame(
  term_name = go_mf_web$term_name,
  p_value_web = go_mf_web$adjusted_p_value,
  p_value_r = go_mf_r$p_value
)


# Reshape the data to long format
comparison_data_long <- comparison_data_go_mf %>%
  pivot_longer(cols = starts_with("p_value"), 
               names_to = "source", 
               values_to = "p_value")

# Create the bar plot with -log10 transformed p-values
print(ggplot(comparison_data_long, aes(x = term_name, y = -log10(p_value), fill = source)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +  # Side-by-side bars
  labs(title = "Adjusted P value comparison for GO:MF enrichments",
       x = "Term Name",
       y = "-log10(P-value)") +
  scale_fill_manual(values = c("p_value_web" = "#ff9900", "p_value_r" = "#3366cc")) +  # Custom colors
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) )  

Great! We know we applied the same parameters, and used the same input gene lists. The identical results must mean that gprofiler2 is using the same database version as g:Profiler web.

Saving versions and session details

Let’s check: yesterday when you ran ORA on the web, hopefully you saved your ‘query parameters’ as well as your results.

From my run, I can see version as ‘e111_eg58_p18_f463989d’.

Let’s report the g:Profiler database version used in our analysis:

paste0("g:Profiler database version: ", ora$meta$version)
## [1] "g:Profiler database version: e111_eg58_p18_f463989d"
paste0("gprofiler2 package version: ", packageVersion("gprofiler2"))
## [1] "gprofiler2 package version: 0.2.3"

We can also capture the version of R and other session details including all loaded packages and versions with the sessionInfo() function:

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.3.1      ggplot2_3.5.1    gprofiler2_0.2.3 dplyr_1.1.4     
## [5] readr_2.1.5     
## 
## loaded via a namespace (and not attached):
##  [1] plotly_4.10.4     sass_0.4.9        utf8_1.2.4        generics_0.1.3   
##  [5] bitops_1.0-7      hms_1.1.3         digest_0.6.35     magrittr_2.0.3   
##  [9] evaluate_0.23     grid_4.4.2        fastmap_1.1.1     jsonlite_1.8.8   
## [13] gridExtra_2.3     promises_1.3.0    httr_1.4.7        purrr_1.0.2      
## [17] fansi_1.0.6       crosstalk_1.2.1   viridisLite_0.4.2 scales_1.3.0     
## [21] textshaping_0.3.7 lazyeval_0.2.2    jquerylib_0.1.4   shiny_1.8.1.1    
## [25] cli_3.6.2         rlang_1.1.3       crayon_1.5.2      bit64_4.0.5      
## [29] munsell_0.5.1     withr_3.0.0       cachem_1.0.8      yaml_2.3.8       
## [33] tools_4.4.2       parallel_4.4.2    tzdb_0.4.0        colorspace_2.1-0 
## [37] httpuv_1.6.15     mime_0.12         vctrs_0.6.5       R6_2.5.1         
## [41] lifecycle_1.0.4   htmlwidgets_1.6.4 bit_4.0.5         vroom_1.6.5      
## [45] ragg_1.3.1        pkgconfig_2.0.3   later_1.3.2       pillar_1.9.0     
## [49] bslib_0.7.0       gtable_0.3.5      Rcpp_1.0.12       data.table_1.15.4
## [53] glue_1.7.0        systemfonts_1.0.6 highr_0.10        xfun_0.43        
## [57] tibble_3.2.1      tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.46       
## [61] farver_2.1.2      xtable_1.8-4      htmltools_0.5.8.1 labeling_0.4.3   
## [65] rmarkdown_2.26    compiler_4.4.2    RCurl_1.98-1.14

And finally, the RStudio version. Typically, we would simply run RStudio.Version() to print the version details. However, when we knit this document to HTML, the output is not available and will cause an error. So to make sure our version details are saved to our static record of the work, we will save to a file, then print the file contents back into the notebook.

# Get RStudio version information
rstudio_info <- RStudio.Version()

# Convert the version information to a string
rstudio_version_str <- paste(
  "RStudio Version Information:\n",
  "Version: ", rstudio_info$version, "\n",
  "Release Name: ", rstudio_info$release_name, "\n",
  "Long Version: ", rstudio_info$long_version, "\n",
  "Mode: ", rstudio_info$mode, "\n",
  "Citation: ", rstudio_info$citation,
  sep = ""
)

# Write the output to a text file
writeLines(rstudio_version_str, "rstudio_version.txt")
# Read the saved version information from the file
rstudio_version_text <- readLines("rstudio_version.txt")

# Print the version information to the document
rstudio_version_text
## [1] "RStudio Version Information:"                                                                                                                                                                                                                                                                           
## [2] "Version: 2023.6.1.524"                                                                                                                                                                                                                                                                                  
## [3] "Release Name: Mountain Hydrangea"                                                                                                                                                                                                                                                                       
## [4] "Long Version: 2023.06.1+524"                                                                                                                                                                                                                                                                            
## [5] "Mode: server"                                                                                                                                                                                                                                                                                           
## [6] "Citation: list(title = \"RStudio: Integrated Development Environment for R\", author = list(list(given = \"Posit team\", family = NULL, role = NULL, email = NULL, comment = NULL)), organization = \"Posit Software, PBC\", address = \"Boston, MA\", year = \"2023\", url = \"http://www.posit.co/\")"

Knit workbook to HTML

The last task is to knit the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work.

On the editor pane toolbar, under ‘Preview’, select ‘Knit to HTML’.

If you have already run ‘Preview’, you will see ‘Knit’ instead of ‘Preview’.